*Input Data
import excel "$root/Data/Original/cufa_report.xlsx", firstrow clear

*Destring the relevant variables
destring Stammnummer, force gen(VID)
destring Kilometerstand, force gen(KM_driven)
drop Stammnummer Kilometerstand 


*get year out of date
split Prüfdatum, p(" ")
split Prüfdatum1, p(-)

destring Prüfdatum11, g(year)
destring Prüfdatum12, g(month)
destring Prüfdatum13, g(day)

*There are some data related troubles:
*drop observations with missing or no ID
drop if VID==.
drop if VID==0

*drop observations with unrealistic driven KM's (assume higher than 500'000 to be unrealistic for a private car)
drop if KM_driven>800000
drop if KM_driven<0



*Decide how to handle duplicates next and then merge over VID with the vehicle data.
collapse (max) KM_driven if KM_driven!=., by(VID year)

sort VID year


collapse (max) max_year=year (min) min_year=year (firstnm) KM_first=KM_driven (lastnm) KM_last=KM_driven (count) nr_year=year , by(VID)


gen year_diff=max_year-min_year
gen KM_diff=KM_last-KM_first

tab year_diff
su KM_diff
tab nr_year


*After inspecting the data, we decide to keep the lastnm observation of each VID. The sometimes big differences between observations of VID's can have several explanations, i.e. a VID was reassigned after a car was destroyed,
*data gathering has mistakes, i.e. wrong data input. 
keep max_year VID KM_last

rename max_year inspect_year
rename KM_last KM_driven

*

save "$root/Data/Original/KM_driven.dta", replace









